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(54) Source normalization training for modeling of speech 

(57) A maximum likelihood (ML) linear regression 
(LR) solution to environment normalization is provided 
where the environment is modeled as a hidden (non- 
observable) variable. By application of an expectation 
maximization algorithm and extension of Baum-Welch 
forward and backward variables (Steps 23a-23d) a 
source normalization is achieved such that it is not nec- 
essary to label a database in terms of environment such 
as speaker identity, channel, microphone and noise 
type. 
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Description 

[0001] This invention relates to training for Hidden Markov Model (HMM) modeling of speech and more particularly to 
removing environmental factors from speech signal during the training procedure. 

[0002] In the present application we refer to speaker, handset or microphone, transmission channel, noise back- 
ground conditions, or combination of these as the environment A speech signal can only be measured in a particular 
environment. Speech recognizers suffer from environment variability because trained model distributions may be 
biased from testing signal distributions because environment mismatch and trained model distributions are flat because 
they are averaged over different environments. 

[0003] The first problem, the environmental mismatch, can be reduced through model adaptation, based on some 
utterances collected in the testing environment. To solve the second problem, the environmental factors should be 
removed from the speech signal during the training procedure, mainly by source normalization. 
[0004] In the direction of source normalization, speaker adaptive training uses linear regression (LR) solutions to 
decrease inter-speaker variability. See for example, T. Anastasakos, et al. entitled, "A compact model for speaker-adap- 
tive training," International Conference on Spoken Language Processing, Vol. 2, October 1996. Another technique 
models mean-vectors as the sum of a speaker-independent bias and a speaker-dependent vector. This is found in A. 
Acero, et al. entitled, "Speaker and Gender Normalization for Continuous-Density Hidden Markov Models," in Proc. Of 
IEEE International Conference on Acoustics, Speech and Signal Processing, pages 342-345, Atlanta. 1996. Both of 
these techniques require explicit label of the classes. For example, speaker or gender of the utterance during the train- 
ing. Therefore, they can not be used to train clusters of classes, which represent acoustically close speaker, hand set 
or microphone, or background noises. Such inability of discovering clusters may be a disadvantage in an application. 
[0005] An illustrative embodiment of the present invention seeks to provide a method for source normalization training 
for HMM modeling of speech that avoids or minimizes above-mentioned problems. 

[0006] Aspects of the invention are specified in the claims. In carrying out principles of the present invention, a method 
provides a maximum likelihood (ML) linear regression (LR) solution to the environment normalization problem, where 
the environment is modeled as a hidden (nonKJbservable) variable. An EM-Based training algorithm can generate opti- 
mal clusters of environments and therefore it is not necessary to label a database in terms of environment For special 
cases, the technique is compared to utterance-by-utterance cepstral mean normalization (CMN) technique and show 
performance improvement on a noisy speech telephone database. 

[0007] In accordance with another feature of the present invention under maximum-likelihood (ML) criterion, by appli- 
cation of EM algorithm and extension of Baum-Welch forward and backward variables and algorithm, a joint solution to 
the parameters for the source normalization is obtained, i.e. the canonical distributions, the transformations and the 
biases. 

[0008] For a better understanding of the present invention, reference will now be made, by way of example, to the 
accompanying drawings, in which: 

Fig. 1 is a block diagram of a system which includes aspects of the present invention; 

Fig. 2 illustrates a speech model; 

Fig. 3 illustrates a Gaussian distribution; 

Fig. 4 illustrates distortions in the distribution caused by different environments; 

Fig. 5 is a more detailed flow diagram of the process according to one embodiment of the present invention; and 
Fig. 6 is a recognizer according to an embodiment of the present invention using a source normlization model. 

[0009] The training is done on a computer workstation having a monitor 1 1 , a computer workstation 13. a keyboard 
15, and a mouse or other interactive device 15a, as shown in Fig. 1. The system maybe connected to a separate data- 
base represented by database 1 7 in Fig. 1 for storage and retrieval of models. 

[0010] By the term "training" we mean herein to fix the parameters of the speech models according to an optimum 
criterion. In this particular case, we use HMM (Hidden Markov Model) models. These models are as represented in Fig. 
2 with states A, B, and C and transitions E, F, G, H, I and J between states. Each of these states has a mixture of Gaus- 
sian distributions 18 represented by Fig. 3. We are training these models to account for different environments. By envi- 
ronment we mean different speaker, handset, transmission channel, and noise background conditions. Speech 
recognizers suffer from environment variability because trained model distributions may be biased from testing signal 
distributions because of environment mismatch and trained model distributions are flat because they are averaged over 
different environments. For the first problem, the environmental mismatch can be reduced through model adaptation, 
based on utterances collected in the testing environment. Applicant's teaching herein is to solve the second problem by ' 
removing the environmental factors from the speech signal during the training procedure. This is source normalization 
training according to the present invention. A maximum likelihood (ML) linear regression (LR) solution to the environ- 
mental problem is provided herein where the environment is modeled as a hidden (non observable) variable. 
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[001 1 ] A clean speech pattern distribution 40 will undergo complex distortion with different environments as shown in 
Fig. 4. The two axes represent two parameters which may be, for example, frequency, energy, formant, spectral, or cep- 
stral components. The Fig. 4 illustrates a change at 41 in the distribution due to backgr und noise or a change in speak- 
ers. The purpose of the application is to model the distortion. 

5 [001 2] The present model assumes the following : 1 ) the speech signal x is generated by Continuous Density Hidden 
Markov Model (CDHMM), called source distributions; 2) before being observed, the signal has undergone an environ- 
mental transformation, drawn from a set of transformations, where W jo is the transformation on the HMM state j of the 
environment e; 3) such a transformation is linear, and is independent of the mixture components of the source; and 4) 
there is a bias vector b^ at the k-th mixture component due to environment e. 

10 [0013] What we observe at time t is: 

Ot=W je x t+ b ke (1) 

[001 4J Our problem now is to find, in the maximum likelihood (ML) sense, the optimal source distributions, the trans- 
is formation and the bias set. 

[0015] In the prior art (A. Acero, et al. cited above and T. Anastasakos, et al. cited above), the environment e must be 
explicit, e.g.: speaker identity, male/female. An aspect of the present invention overcomes this limitation by allowing an 
arbitrary number of environments which are optimally trained. 

[0016] Let N be the number of HMM states, M be the mixture number, L be the number of environments, fig A {1, 2, 
20 ... N} be the set of states A{1 , 2. ... M} be the set of mixture indicators, and A{1 , 2, ... L} be the set of environ- 
mental indicators. 

[0017] For an observed speech sequence of T vectors: O A o] A (01 , 02, ... o T ), we introduce state sequence 0 A{8o, 
... 87) where e t g mixture indicator sequence 3 A(£ 1( ...*Vr) where e f^, and environment indicator sequence <D 
A(<p 1t ... 9r) where ^6^. They are all unobservable. Under some additional assumptions, the joint probability of 0, 
25 ©, 5, and O given model k can be written as: 



30 



40 



45 



p(0,®,EM^ = ^Yl^ 6 ^(o,)^J 9 (2) 



where 



c jk£P(€ = *|* = J. A), /, A p{<p = eU) ( 6 ) 



[0018] Referring to Fig. 1 , the workstation 13 including a processor contains a program as illustrated that starts with 
so an initial standard HMM model 21 which is to be refined by estimation procedures using Baum-Welch or Estimation- 
Maximization procedures 23 to get new models 25. The program gets training data at database 19 under different envi- 
ronments and this is used in an iterative process to get optimal parameters. From this model we get another model 25 
that takes into account environment changes. The quantities are defined by probabilities of observing a particular input 
vector at some particular state for a particular environment given the model. 
ss [0019] The model parameters can be determined by applying generalized EM-procedure with three types of hidden 
variables: state sequence, mixture component indicators, and environment indicators. (A. P. Dempster. N. M. Laird, and 
D. B. Rubin, entitled "Maximum Likelihood from Incomplete Data via the EM Algorithm." Journal of the Royal Statistical 
Society, 39 (1): 1-38, 1977.) For this purpose, Applicant teaches the CDHMM formulation from B, Juang, "Maximum- 
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Likelihood Estimation for Mixture Multivariate Stochastic Observation of Markov Chains" {The Bell System Technical 
Journal, pages 1235-1248. July-August 1985) to be extended to result in the following paragraphs: Denote: 



[0020] The speech is observed as a sequence of frames (a vector). Equations 7, 8. and 9 are estimations of interme- 
diate quantities. For example, in equation 7 is the joint probability of observing the frames from times 1 to t at the state 
j at time t and for the environment of e given the model X. 

[0021] The following re-estimation equations can be derived from equations 2, 7, 8. and 9. 
[0022] For the EM procedure 23, equations 1 0-21 are solutions for the quantities in the model. 

Initial state probability: 



<*< 0» A p(cl , 6} = j, <p = «|T) 
AU>e)*p(oH x \0 t =j f <p = eJ) 

r t (M,')A p(0 t = j 9 £ = k,<p = *|o J) 



19) 



(8) 



(7) 




(10) 



[0023] with R the number of training tokens. 
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Transition probability: 

* r 



= 5 4 ? (11) 



Murtwe Component probability: (Mixture probability is where there is a mixture of Gaussian distributions) 

ZZ2>;o\*.«) 

C *=~ R p (12) 

Z^ZZ<o>)/ro>) 



25 Environment probability: 



— — 

30 



/Wean vector and bias vector: We introduce: 

35 



* r 



r-l /-I 



(13) 



40 rir 1 

gU,k,e)&Z,Z*rfUXe) (15) 



45 

[0027] and 



50 



55 
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G u = (16) 



10 



E J u = g(j.k,e)W it '^'J t (17) 
*"* = Z V> (18) 

a >* = Z^'S^/.*^) (19) 



/5 ^ = S^p(y.*^). (20) 



Assuming 

20 



and 

25 



, for a given k, we have N+L equations: 



30 



35 



40 



V / e ^ (21) 



G ^ + E /f y^/ i >* =c u Weft r (22) 



These equations 21 and 22 are solved jointly for mean vectors and bias vectors. 

[0028] Therefore ^ and bk 9 can be simultaneously obtained by solving the linear system of N+L variables. 



45 Covariance: 



so 



55 [0029] where 
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5 Transformation: We assume covariance matrix to be diagonal: 

I ;*•*>- o 

w if n*m. For the line m of transformation W^. we can derive (see for example C. J. Leggetter, et al.. entitled "Maximum 
Likelihood Linear Regression for Speaker Adaptation of Continuos Density HMMs" Computer, Speech and Language, 
9(2): 171-185, 1995.): 

ZF~W»R M (ri (24) 

which is a linear system of D equations, where: 

20 jr r 

ZT'A Zl^VZIrfC/.*.** -^) ( "> (25) 

k*l m rml fol 
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r r 



K p "w*Zzr"WW n) llr;u,k,e). (26) 

Assume the means of the source distributions brf are constant, then the above set of source normalization formulas 
can also be used for model adaptation. 

[0030] The model is specified by the parameters. The new model is specified by the new parameters. 
[0031] As illustrated in Figs. 1 and 5, we start with an initial as standard model 21 such as the CDHMM model with 
initial values. This next step is the Estimation Maximization 23 procedure starting with (Step 23a) equations 7-9 and 
reestimation (Step 23b) equations 1 0-13 for initial state probability, transition probability, mixture component probability 
and environment probability. 

[0032] The next step (23c) is to derive a means vector and a bias vector by introducing two additional equations 1 4 
and 15 and equation 16-20. The next step 23d is to apply linear equations 21 and 22 and solve 21 and 22 jointly for 
mean vectors and bias vectors and at the same time calculate the variance using equation 23. Using equation 24 which 
is a system of linear equations will solve for transformation parameters using quantities given by equation 25 and 26 
Then we have solved for all the model parameters. Then replace the old model parameters by the newly calculated 
ones (Step 24). Then the process is repeated for all the frames. When this is done for all the frames of the database a 
new model is formed and then the new models are re-evaluated using the same equation until there is no change 
beyond a predetermined threshold (Step 27). 

[0033] After a source normalization training model is formed, this model is used in a recognizer as shown in Fig. 6 
where input speech is applied to a recognizer 60 which used the source normalized HMM model 61 created by the 
above training to achieve the response. 

[0034] The recognition task has 53 commands of 1 -4 words, ("call return", "cancel call return", "selective call forward- 
ing", etc.). Utterances are recorded through telephone lines, with a diversity of microphones, including carbon, electret 
and cordless microphones and hands-free speaker-phones. Some of the training utterances do not correspond to their 
transcriptions. For example: "call screen" (cancel call screen), "matic call back" (automatic call back), "call tra" (call 
tracking). 

[0035] The speech is 8kHz sampled with 20ms frame rate. The observation vectors are composed of LPCC (Linear 
Prediction Coding Coefficients) derived 13-MFCC (Mel-Scale Cepstral Coefficients) plus regression based delta 
MFCC. CMN is performed at the utterance level. There are 3505 utterances for training and 720 for speaker-independ- 
ent testing. The number of utterances per call ranges between 5-30. 

[0036] Because of data sparseness. besides transformation sharing among states and mixtures, the transformations 
need to be shared by a group of phonetically similar phones. The grouping, based on an hierarchical clustering of 
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phones, is dependent on the amount of training (SN) or adaptation (AD) data, i.e., the larger the number of tokens is, 
the larger the number of transformations. Recognition experiments are run on several system configurations: 

• BASELINE applies CMN utterance-by-utterance. This simple technique will remove channel and some long term 
speaker specificities, if the duration of the utterance is long enough, but can not deal with time domain additive 
noises. 

• SN performs source-normalized HMM training, where the utterances of a phone-call are assumed to have been 
generated by a call-dependent acoustic source. Speaker, channel and background noise that are specific to the call 
is then removed by MLLR. An HMM recognizer is then applied using source parameters. We evaluated a special 
case, where each call is modeled by one environment 

• AD adapts traditional HMM parameters by unsupervised MLLR. 1 . Using current HMMs and task grammar to pho- 
netically recognize the test utterances. 2. Mapping the phone labels to a small number (N) of classes, which 
depends on the amount of data in the test utterances, 3. Estimating the LR using the N-classes and associated test 
data, 4. Recognizing the test utterances with transformed HMM. A similar procedure has been introduced in C J 
Legetter and P. C. Woodland. "Maximum likelihood linear regression for speaker adaptation of continuous density 
HMMs." Computer, Speech and Language, 9(2) :1 71 -185, 1995, 

• SN+AD refers to AD with initial models trained by SN technique. 

Based on the results summarized in Table 1 , we point out: 

• For numbers of mixture components per state smaller than 16, SN, AD, and SN+AD all give consistent improve- 
ment over the baseline configuration. 

• For numbers of mixture components per state smaller than 16. SN gives about 10% error reduction over the base- 
line. As SN is a training procedure which does not require any change to the recognizer, this error reduction mech- 
anism immediately benefits applications. 

• For all tested configurations, AD using acoustic models trained with SN procedure always gives additional error 
reduction. 



• The most efficient case of SN+AD is with 32 components per state, which reduces error rate by 23% 
4.64% Word Error Rate (WER) on the task. 





4 


8 


16 


32 


Baseline 


7.85 


6.94 


6.83 


5.98 


SN 


7.53 


6.35 


6.51 


6.03 


AD 


7.15 


6.41 


5.61 


5.87 


SN+AD 


6.99 


6.03 


5.41 


4.64 



Table 1: Word error rate (%) as function of test configuration 
and number of mixture components per state. 



[0037] Although the present invention and its advantages have been described in detail, it should be understood that 
various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the 
invention as defined by the appended claims. 

Claims 

1 . A method of source normalization training for HMM modeling of speech comprising the steps of: 
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(a) providing an initial model; 

(b) on said initial model or following new models performing the following steps to get a new model: 

estimation of intermediate quantities; 
ba) performing re-estimation to determine initial state probability, transition probability, mixture component 
probability and environment probability; 
ID3) deriving mean vector and bias vector; 

b^ solving jointly for mean vector and bias vector using linear equations and determining variances and 
transformation; and 

bs) replacing old model parameters for the calculated ones; and 

c) determining after a new model is formed if it differs significantly from the previous model and if so repeating 
steps bi - b§. 

2. The method of Claim 1 wherein in step b 1 estimation intermediate quantities is determined by 



h9 



e,4) 



r t (lk,e)*p(0 t =j,£ 



= *,p = e|CU) 



, and . 



3. The method of Claim 2 wherein in step fc>2 the initial state probability is determined by 




transition probability is determined by 





mixture component probability is determined by 



* r 




and environment probability is determined by 
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The method according to any of Claims 2 to 3 wherein step 63 deriving mean vector and bias vector is determined 
by 



5. The method according to any of Claims 2 to 4 wherein step b 4 equations 

Z EjJ>u+Fjk Mjt^aj, Vy e «, and GJb u + ^ l H Jit p A = c to Ve 6 n, 
**• 

are used for solving jointly and equation 

£ Z «g. Z liTZiK UXe)S t 'UXe)%(j\e,ky 

is used to determine variance and equations Z (m) jg = {m) R jg (m) , 

Z^klXT^^i^rlUXeM -^)<"> , and 

r-l r-l 

are used to determine transformation. 

6. An improved speech recognition system comprising: 

a speech recognizer; and 

a source normalization model derived by application of an estimation maximization algorithm. 
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